Q-DETR: An Efficient Low-Bit Quantized Detection Transformer

33

2.4.4

Distribution Rectification Distillation

Inner-level optimization. We first detail the maximization of self-information entropy.

According to the definition of self information entropy, H(qS) can be implicitly expanded

as:

H(qS) =

qS

i qS p(qS

i )log p(qS

i ).

(2.30)

However, an explicit form of H(qS) can only be parameterized with a regular distribution

p(qS

i ). Luckily, the statistical results in Fig. 2.8 show that the query distribution tends to

follow a Gaussian distribution, also observed in [136]. This enables us to solve the inner-

level optimization in a distribution alignment fashion. To this end, we first calculate the

mean μ(qS) and variance σ(qS) of query qS whose distribution is then modeled as qS

N(μ(qS), σ(qS)). Then, the self-information entropy of the student query can proceed as:

H(qS) =E[log N(μ(qS), σ(qS))]

=E[log[(2πσ(qS)

2)

1

2 exp((qS

i μ(qS))

2

2σ(qS)2

)]]

= 1

2 log 2πσ(qS)

2.

(2.31)

The above objective reaches its maximum of H(qS) = (1/2) log 2πe[σ(qS)

2 +ϵqS] when

qS= [qS μ(qS)]/[



σ(qS)2 + ϵqS] where ϵqS = 1e5 is a small constant added to prevent

a zero denominator. The mean and variance might be inaccurate in practice due to query

data bias. To solve this, we use the concepts in batch normalization (BN) [207, 102] where

a learnable shifting parameter βqS is added to move the mean value. A learnable scaling

parameter γqS is multiplied to move the query to the adaptive position. In this situation,

we rectify the information entropy of the query in the student as follows:

qS=

qS μ(qS)



σ(qS)2 + ϵqS

γqS + βqS,

(2.32)

in which case the maximum self-information entropy of student query becomes H(qS) =

(1/2) log 2πe[(σ2

qS + ϵqS)2

qS]. Therefore, in the forward propagation, we can obtain the

current optimal query qSvia Eq. (2.32), after which, the upper-level optimization is further

executed as detailed in the following contents.

Upper-level optimization. We continue minimizing the conditional information en-

tropy between the student and the teacher. Following DETR [31], we denote the ground-

truth labels by yGT = {cGT

i

, bGT

i

}Ngt

i=1 as a set of ground-truth objects where Ngt is the num-

ber of foregrounds, cGT

i

and bGT

i

respectively represent the class and coordinate (bounding

box) for the i-th object. In DETR, each query is associated with an object. Therefore, we

can obtain N objects for teacher and student as well, denoted as yS = {cS

j , bS

j }N

j=1 and

yT = {cT

j , bT

j }N

j=1.

The minimization of the conditional information entropy requires the student and

teacher objects to be in a one-to-one matching. However, it is problematic for DETR due

primarily to the sparsity of prediction results and the instability of the query’s predic-

tions [129]. To solve this, we propose a foreground-aware query matching to rectify “well-

matched” queries. Concretely, we match the ground-truth bounding boxes with this student

to find the maximum coincidence as:

Gi = max

1jN GIoU(bGT

i

, bS

j ),

(2.33)